摘要 :
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data....
展开
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner, for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous treestructure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching.
收起
摘要 :
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability storing, querying, and manipulating complex data. Th...
展开
In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner. for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous tree-structure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching. (c) 2008 Elsevier Inc. All rights reserved.
收起
摘要 :
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semi-structured data, and so on. In this paper we introduce SLEUTH, an efficient algorithm for mining frequent, unordered, embedded subtrees i...
展开
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semi-structured data, and so on. In this paper we introduce SLEUTH, an efficient algorithm for mining frequent, unordered, embedded subtrees in a database of labeled trees. The key contributions of our work are as follows: We give the first algorithm that enumerates all embedded, unordered trees. We propose a new equivalence class extension scheme to generate all candidate trees. We extend the notion of scope-list joins to compute frequency of unordered trees. We conduct performance evaluation on several synthetic and real datasets to show that SLEUTH is an efficient algorithm, which has performance comparable to TreeMiner, that mines only ordered trees.
收起
摘要 :
Frequent pattern mining is an important data mining task with a broad range of applications. Initially focused on the discovery of frequent itemsets, studies were extended to mine structural forms like sequences, trees or graphs. ...
展开
Frequent pattern mining is an important data mining task with a broad range of applications. Initially focused on the discovery of frequent itemsets, studies were extended to mine structural forms like sequences, trees or graphs. In this paper, we introduce a new domain of patterns, attributed trees (atrees), and a method to extract these patterns in a forest of atrees. Attributed trees are trees in which vertices are associated with itemsets. Mining this type of patterns (called asubtrees), which combines tree mining and itemset mining, requires the exploration of a huge search space. To make our approach scalable, we investigate the mining of condensed representations. For attributed trees, the classical concept of closure involves both itemset closure and structural closure. We present three algorithms for mining all patterns, closed patterns w.r.t. itemsets (content) and/or structure in attributed trees. We show that, for low support values, mining content-closed attributed trees is a good compromise between non-redundancy of solutions and execution time.
收起
摘要 :
Frequent itemsets are itemsets that occur frequently in a dataset. Frequent itemset mining extracts specific itemsets with supports higher than or equal to a minimum support threshold. Many mining methods have been proposed but Ap...
展开
Frequent itemsets are itemsets that occur frequently in a dataset. Frequent itemset mining extracts specific itemsets with supports higher than or equal to a minimum support threshold. Many mining methods have been proposed but Apriori and FP-growth are still regarded as two prominent algorithms. The performance of the frequent itemset mining depends on many factors; one of them is searching the nodes while constructing the tree. This paper introduces a new prefix-tree structure called child structured frequent pattern tree (CSFP-tree), an FP-tree attached with a child search subtree to each node. The experimental results reveal that the CSFP-tree is superior to the FP-tree and its new variations for any kind of datasets.
收起
摘要 :
Caching query results is one efficient approach to improving the performance of XML management systems. This entails the discovery of frequent XML queries issued by users. In this paper, we model user queries as a stream of XML qu...
展开
Caching query results is one efficient approach to improving the performance of XML management systems. This entails the discovery of frequent XML queries issued by users. In this paper, we model user queries as a stream of XML query pattern trees and mine the frequent query patterns over the query stream. To facilitate the one-pass mining process, we devise a novel data structure called DTS to summarize the pattern trees seen so far. By grouping the incoming pattern trees into batches, we can dynamically mark the active portion of the current batch in DTS and limit the enumeration of candidate trees to only the currently active pattern trees. We also design another summary data structure called ECTree that provides for the incremental computation of the frequent tree patterns over the query stream. Based on the above two constructs, we present two mining algorithms called XQSMinerl and XQSMinerⅡ. XQSMinerl is fast, but it tends to overestimate, while XQSMinerⅡ adopts a filter-and-refine approach to minimize the amount of overestimation. Experimental results show that the proposed methods are both efficient and scalable and require only small memory footprints.
收起
摘要 :
Discovering specified patterns in a time series database has expected much consideration and is nowadays a comparatively mature field. Existing Periodic Pattern Mining algorithms concentrates on mining which involves subsequences....
展开
Discovering specified patterns in a time series database has expected much consideration and is nowadays a comparatively mature field. Existing Periodic Pattern Mining algorithms concentrates on mining which involves subsequences. However, huge portion of requests for example, genetic DNA and protein pattern mining requires estimated patterns that are adjacent in nature. The existing algorithms applied to discover such estimated pattern mining comprises of complicated problems such as deprived scalability and complexity while applying towards certain other applications. To overcome these limitations, a novel technique is presented that evolves a set of periodic pattern if the regularity of the occurrence changes from that estimated pattern. The technique is based on the combination of both suffix and prefix tree patterns, to develop a multiplex tree pruning, for an activity normalized time periodicity data sequences. The integrative sequence of prefix and suffix trees is based on the threshold factor of predominant data pattern occurrence rate. The conceptual model of multiplex tree pruning technique presented in this study, in combination with the prefix and suffix tree model for pruning items identifies the regularity of all observed patterns in an efficient manner. The detailed experimental study shows strong gains in periodic pattern mining, ensure fast storage of all the time series for a specified item. Empirical studies with varied time series data obtained from bank and car data set using UCI repositories is measured and evaluated in terms of time efficiency of pruning patterns of interest, sensitivity and accuracy.
收起
摘要 :
Frequent pattern mining (FPM) is an important data mining paradigm to extract informative patterns like itemsets, sequences, trees, and graphs. However, no practical framework for integrating the FPM tasks has been attempted. In t...
展开
Frequent pattern mining (FPM) is an important data mining paradigm to extract informative patterns like itemsets, sequences, trees, and graphs. However, no practical framework for integrating the FPM tasks has been attempted. In this paper, we describe the design and implementation of the Data Mining Template Library (DMTL) for FPM. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. It uses a novel pattern property hierarchy to define and mine different pattern types. This property hierarchy can be thought of as a systematic characterization of the pattern space, i.e., a meta-pattern specification that allows the analyst to specify new pattern types, by extending this hierarchy. Furthermore, in DMTL all aspects of mining are controlled by a set of different mining properties. For example, the kind of mining approach to use, the kind of data types and formats to mine over, the kind of back-end storage manager to use, are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications. Flexibility of the toolkit is exemplified by the ease with which support for a new pattern can be added. Experiments on synthetic and public dataset are conducted to demonstrate the scalability provided by the persistent back-end in the library. DMTL been publicly released as open-source software (http://dmtl.sourceforge.net/), and has been downloaded by numerous researchers from all over the world.
收起
摘要 :
Most work on pattern mining focuses on simple data structures such as itemsets and sequences of itemsets. However, a lot of recent applications dealing with complex data like chemical compounds, protein structures, XML and Web log...
展开
Most work on pattern mining focuses on simple data structures such as itemsets and sequences of itemsets. However, a lot of recent applications dealing with complex data like chemical compounds, protein structures, XML and Web log databases and social networks, require much more sophisticated data structures such as trees and graphs. In these contexts, interesting patterns involve not only frequent object values (labels) appearing in the graphs (or trees) but also frequent specific topologies found in these structures. Recently, several techniques for tree and graph mining have been proposed in the literature. In this paper, we focus on constraint-based tree pattern mining. We propose to use tree automata as a mechanism to specify user constraints over tree patterns. We present the algorithm CoBMiner which allows user constraints specified by a tree automata to be incorporated in the mining process. An extensive set of experiments executed over synthetic and real data (XML documents and Web usage logs) allows us to conclude that incorporating constraints during the mining process is far more effective than filtering the interesting patterns after the mining process.
收起
摘要 :
In this paper, we explore a new sequence pattern technique called BC-WAPT (Binary coded Web Access pattern Tree).it eliminates recursive reconstruction of intermediate WAP tree during the mining by assigning the binary codes to ea...
展开
In this paper, we explore a new sequence pattern technique called BC-WAPT (Binary coded Web Access pattern Tree).it eliminates recursive reconstruction of intermediate WAP tree during the mining by assigning the binary codes to each node in the WAPTree. Sequential Pattern mining is the process of applying data mining techniques to a sequential database for the purposes of discovering the correlation relationships that exist among an ordered list of events. Web access pattern tree (WAP-tree) mining is a sequential pattern mining technique for web log access sequences, which first stores the original web access sequence database on a prefix tree, similar to the frequent pattern tree (FP-tree) for storing non-sequential data. WAP-tree algorithm then, mines the frequent sequences from the WAP-tree by recursively re-constructing intermediate trees, starting with suffix sequences and ending with prefix sequences. An attempt has been made to modify WAP tree approach for improving efficiency. BCWAPT totally eliminates the need to engage in numerous reconstruction of intermediate WAP-trees during mining and considerably reduces execution time.
收起